Use filter to select observations (rows) from a tibble.

Logical conditions

In the previous section we learned how to select variables (columns) with all its values using the select function. In this section we are interested in a collection of observations (rows). For this we have the function filter which takes as its first argument the tibble followed by a logical condition describing the variable values constraints.

In the context of filter function a logical condition is a way to express whether some statement about an observation is true or false, if true then that observation is returned otherwise it is ignored. Note that the condition is applied to all observation in the tibble.

For example, let’s say we want to know which individuals in our pulse dataset are taller than 190 cm? You can achieve this with filter as follows:

filter(pulse, height>190) # filter all observations with height greater than 190 cm 
# A tibble: 5 × 13
  id     name   height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
  <chr>  <chr>   <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
1 1993_D Travis    195     84    18 male   no     yes     high    sat       71     73
2 1997_K John      191     78    19 male   no     yes     high    ran       68    136
3 1997_O Albert    194    110    25 male   no     no      modera… sat       75     75
4 1997_Q Lance     192    105    21 male   no     no      modera… sat       80     73
5 1997_R Chris…    194     95    18 male   no     yes     modera… ran       84    140
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹​exercise

or another example would be, who’s weight is exactly 55 kg?

filter(pulse, weight==55) # filter all observations with weight equal to 55 kg 
# A tibble: 5 × 13
  id     name   height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
  <chr>  <chr>   <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
1 1993_H Franc…    169     55    18 female no     yes     modera… sat       71     77
2 1993_W Glenna    163     55    20 female no     no      low     sat       78     74
3 1995_J Gwen      163     55    20 female no     no      modera… ran       70    119
4 1998_G Ursula    155     55    20 female no     yes     high    sat       82     87
5 1998_I Elean…    168     55    24 female no     no      modera… sat       74     70
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹​exercise

filter uses logical vectors in the background to achieve the result. You’ve already seen an example of a logical vector in the section on Vectors. The result of a comparison with relational operators {==, !=, <, <=, >, >=} is a logical vector.

‘==’ stands for ‘equal to’ and ‘!=’ is ‘not equal to’

In the examples above the comparisons height>190 and weight==55 are in fact logical vectors with the same length as the number of observations (rows) in the pulse dataset. The function filter takes the logical vector and returns only the rows in positions for which the value is TRUE.

Multiple conditions : and (‘&’), or (‘|’)

In the previous examples with filter we only had a single condition. We can expand the conditions to ask more complex questions. For example which males weigh less than 70? Here we have two conditions:

And : ‘&’

Both these conditions must be met, with other words for each observation gender=="male" and weight<70 must be true and this is expressed with the symbol & :

filter(pulse, gender=="male" & weight<70)
# A tibble: 24 × 13
   id     name  height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
   <chr>  <chr>  <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
 1 1993_J Troy     168     60    23 male   no     yes     modera… ran       88    150
 2 1993_L Fred…    178     58    19 male   no     no      low     sat       74     76
 3 1993_M Just…    170     68    22 male   yes    yes     modera… sat       70     71
 4 1993_N Erne…    187     59    18 male   no     yes     high    sat       78     82
 5 1993_Q Lesl…    170     56    19 male   no     no      low     sat       64     63
 6 1993_U Jero…    175     60    19 male   no     no      low     sat       88     86
 7 1993_Y John     176     59    19 male   no     no      modera… sat       68     69
 8 1995_A Pedro    170     60    18 male   no     yes     modera… sat       62     59
 9 1995_C Oscar    189     60    19 male   no     yes     modera… ran       78    168
10 1995_I Marc     175     65    19 male   no     yes     modera… ran       60    104
# … with 14 more rows, 1 more variable: year <dbl>, and abbreviated variable name
#   ¹​exercise

Another example: select females with high frequency of exercise:

filter(pulse, gender=="female" & exercise=="high")
# A tibble: 3 × 13
  id     name   height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
  <chr>  <chr>   <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
1 1993_C Consu…    167     62    18 female no     yes     high    ran       96    176
2 1993_I Sonja     164     56    19 female no     yes     high    sat       68     68
3 1998_G Ursula    155     55    20 female no     yes     high    sat       82     87
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹​exercise

Or : ‘|’

The or conditional operator is true when at least one of the conditions is true. For example, list all rows with individuals who drink or smoke or both:

filter(pulse, alcohol=="yes" | smokes=="yes")
# A tibble: 70 × 13
   id     name  height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
   <chr>  <chr>  <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
 1 1993_A Bonn…    173     57    18 female no     yes     modera… sat       86     88
 2 1993_B Mela…    179     58    19 female no     yes     modera… ran       82    150
 3 1993_C Cons…    167     62    18 female no     yes     high    ran       96    176
 4 1993_D Trav…    195     84    18 male   no     yes     high    sat       71     73
 5 1993_E Lauri    173     64    18 female no     yes     low     sat       90     88
 6 1993_F Geor…    184     74    22 male   no     yes     low     ran       78    141
 7 1993_G Cher…    162     57    20 female no     yes     modera… sat       68     72
 8 1993_H Fran…    169     55    18 female no     yes     modera… sat       71     77
 9 1993_I Sonja    164     56    19 female no     yes     high    sat       68     68
10 1993_J Troy     168     60    23 male   no     yes     modera… ran       88    150
# … with 60 more rows, 1 more variable: year <dbl>, and abbreviated variable name
#   ¹​exercise

There are 70 observations who at least do one of the two: drink or smoke.

Which group is missing from the result above?

Those that neither smoke nor drink.


Another example, filter rows from years 1995 and 1997

filter(pulse, year==1995 | year==1997)
# A tibble: 45 × 13
   id     name  height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
   <chr>  <chr>  <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
 1 1995_A Pedro    170     60    18 male   no     yes     modera… sat       62     59
 2 1995_B Olga     172     60    21 female no     no      low     sat       81     79
 3 1995_C Oscar    189     60    19 male   no     yes     modera… ran       78    168
 4 1995_D Kelli    178     56    21 female no     yes     modera… ran       86    150
 5 1995_E Scott    175     75    20 male   no     yes     modera… ran       59     92
 6 1995_F Bobby    180     85    19 male   yes    yes     modera… ran       68    125
 7 1995_G Laur…    160     57    19 female no     no      modera… ran       75    130
 8 1995_H Eliza    164     66    23 female no     no      low     ran       74    168
 9 1995_I Marc     175     65    19 male   no     yes     modera… ran       60    104
10 1995_J Gwen     163     55    20 female no     no      modera… ran       70    119
# … with 35 more rows, 1 more variable: year <dbl>, and abbreviated variable name
#   ¹​exercise

Here we have the same variable year in our | construct.

What is wrong with this command: filter(pulse, year==1995 & year==1997) ?

An observation in a variable can not take two values at the same time.


Helper functions

is.na()

You have already seen is.na in action with vectors. It can be used in the context of filter as a logical condition to test for missing value: NA.

Take for example the pulse data, it has missing values in pulse1 and pulse2. We can find the row(s) with missing value for pulse1 with:

filter(pulse, is.na(pulse1))
# A tibble: 1 × 13
  id     name   height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
  <chr>  <chr>   <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
1 1997_G Camil…    173     64    20 female no     yes     modera… sat       NA     NA
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹​exercise

We can see there is a single row with missing value pulse1. What about pulse2?

filter(pulse, is.na(pulse2))
# A tibble: 1 × 13
  id     name   height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
  <chr>  <chr>   <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
1 1997_G Camil…    173     64    20 female no     yes     modera… sat       NA     NA
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹​exercise

Also a single row and it looks like to be the same row as the one with missing value for pulse1

Can you think of a logical condition that confirms that the missing pulse1 and pulse2 values are actually in the same observation?

filter(pulse, is.na(pulse1) | is.na(pulse2)) results in a single row.


Equality test with NA, i.e == NA, always results in NA. Therefore the convinient function is.na().

Negation operator: !

If you are interested only in observation that do not contain missing values, in this case in pulse1 and pulse2:

filter(pulse, ! is.na(pulse1))
# A tibble: 109 × 13
   id     name  height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
   <chr>  <chr>  <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
 1 1993_A Bonn…    173     57    18 female no     yes     modera… sat       86     88
 2 1993_B Mela…    179     58    19 female no     yes     modera… ran       82    150
 3 1993_C Cons…    167     62    18 female no     yes     high    ran       96    176
 4 1993_D Trav…    195     84    18 male   no     yes     high    sat       71     73
 5 1993_E Lauri    173     64    18 female no     yes     low     sat       90     88
 6 1993_F Geor…    184     74    22 male   no     yes     low     ran       78    141
 7 1993_G Cher…    162     57    20 female no     yes     modera… sat       68     72
 8 1993_H Fran…    169     55    18 female no     yes     modera… sat       71     77
 9 1993_I Sonja    164     56    19 female no     yes     high    sat       68     68
10 1993_J Troy     168     60    23 male   no     yes     modera… ran       88    150
# … with 99 more rows, 1 more variable: year <dbl>, and abbreviated variable name
#   ¹​exercise

between(x, left, right)

With between function it is possible to filter rows based on variable intervals:

# filter rows with height >=165 and height <= 166
filter(pulse, between(height, 165, 166))
# A tibble: 6 × 13
  id     name   height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
  <chr>  <chr>   <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
1 1993_S Nicole    166     56    21 female yes    no      modera… sat       83     79
2 1995_M Desir…    165     63    18 female no     yes     modera… sat       84     84
3 1996_R Lucy      166     50    19 female no     yes     low     ran       76    132
4 1997_E Julia…    165     48    19 female no     no      low     sat       83     84
5 1998_C Dona      165     58    23 female no     yes     low     sat       64     68
6 1998_H Haley     165     60    19 female yes    yes     low     ran       88    120
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹​exercise

Note that example above can be explicitly written down with & :

filter(pulse, height >=  165 &  height <= 166 )
# A tibble: 6 × 13
  id     name   height weight   age gender smokes alcohol exerc…¹ ran   pulse1 pulse2
  <chr>  <chr>   <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>   <chr>  <dbl>  <dbl>
1 1993_S Nicole    166     56    21 female yes    no      modera… sat       83     79
2 1995_M Desir…    165     63    18 female no     yes     modera… sat       84     84
3 1996_R Lucy      166     50    19 female no     yes     low     ran       76    132
4 1997_E Julia…    165     48    19 female no     no      low     sat       83     84
5 1998_C Dona      165     58    23 female no     yes     low     sat       64     68
6 1998_H Haley     165     60    19 female yes    yes     low     ran       88    120
# … with 1 more variable: year <dbl>, and abbreviated variable name ¹​exercise

Both are valid solutions.



Copyright © 2023 Biomedical Data Sciences (BDS) | LUMC